Voice and text communication system, method and apparatus

ABSTRACT

The disclosure relates to systems, methods and apparatus to convert speech to text and vice versa. One apparatus comprises a vocoder, a speech to text conversion engine, a text to speech conversion engine, and a user interface. The vocoder is operable to convert speech signals into packets and convert packets into speech signals. The speech to text conversion engine is operable to convert speech to text. The text to speech conversion engine is operable to convert text to speech. The user interface is operable to receive a user selection of a mode from among a plurality of modes, wherein a first mode enables the speech to text conversion engine, a second mode enables the text to speech conversion engine, and a third mode enables the speech to text conversion engine and the text to speech conversion engine.

TECHNICAL FIELD

The disclosure relates to communications and, more particularly, to avoice and text communication system, method and apparatus.

BACKGROUND

A cellular phone may include an audio capture device, such as amicrophone and/or speech synthesizer, and an audio encoder to generateaudio packets or frames. The phone may use communication protocol layersand modules to transmit packets across a wireless communication channelto a network or another communication device.

SUMMARY

One aspect relates to an apparatus comprising a vocoder, a speech totext conversion engine, a text to speech conversion engine, and a userinterface. The vocoder is operable to convert speech signals intopackets and convert packets into speech signals. The speech to textconversion engine is operable to convert speech to text. The text tospeech conversion engine is operable to convert text to speech. The userinterface is operable to receive a user selection of a mode from among aplurality of modes, wherein a first mode enables the speech to textconversion engine, a second mode enables the text to speech conversionengine, and a third mode enables the speech to text conversion engineand the text to speech conversion engine.

Another aspect relates to an apparatus comprising: a vocoder operable toconvert speech signals into packets and convert packets into speechsignals; a speech to text conversion engine operable to convert speechto text; a user interface operable to receive a user selection of a modefrom among a plurality of modes, wherein a first mode enables thevocoder, and a second mode enables the speech to text conversion engine;and a transceiver operable to wirelessly transmit encoded speech packetsand text packets to a communication network.

Another aspect relates to a network apparatus comprising: a vocoderoperable to convert packets into speech signals; a speech to textconversion engine operable to convert speech to text; a selection unitoperable to switch between first and second modes, wherein the firstmode enables the vocoder, and a second mode enables the vocoder and thespeech to text conversion engine; and a transceiver operable towirelessly transmit encoded speech packets and text packets to acommunication network.

Another aspect relates to a method comprising: receiving encoded speechpackets; converting the received encoded speech packets into speechsignals; and receiving a user selection of a mode from among a pluralityof modes, wherein a first mode enables speech to text conversion, asecond mode enables text to speech conversion, and a third mode enablesspeech to text and text to speech conversion.

The details of one or more embodiments are set forth in the accompanyingdrawings and the description below.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 illustrates a system comprising a first communication device, anetwork, and a second communication device.

FIG. 2 illustrates a method of using the second device of FIG. 1.

FIG. 3 illustrates another configuration of the first communicationdevice of FIG. 1.

FIG. 4 illustrates another configuration of the network of FIG. 1.

DETAILED DESCRIPTION

Receiving a call on a mobile device in a meeting, airplane, train,theater, restaurant, church or other place may be disruptive to others.It may be much less disruptive if a user could select another mode onthe mobile device to receive the call and/or respond to the call. In onemode, the device receives the call and converts speech/voice signals totext without requiring the caller on the other end to input text.

FIG. 1 illustrates a system comprising a first communication device 100,a network 110, and a second communication device 120. The system mayinclude other components. The system may use any type of wirelesscommunication, such as Global System for Mobile communications (GSM),code division multiple access (CDMA), CDMA2000, CDMA2000 1x EV-DO,Wideband CDMA (WCDMA), orthogonal frequency division multiple access(OFDMA), Bluetooth, WiFi, WiMax, etc.

The first communication device 100 comprises a voice coder (vocoder) 102and a transceiver 104. The first communication device 100 may includeother components in addition to or instead of the components shown inFIG. 1. The first communication device 100 may represent or beimplemented in a landline (non-wireless) phone, a wireless communicationdevice, a personal digital assistant (PDA), a handheld device, a laptopcomputer, a desktop computer, a digital camera, a digital recordingdevice, a network-enabled digital television, a mobile phone, a cellularphone, a satellite telephone, a camera phone, a terrestrial-basedradiotelephone, a direct two-way communication device (sometimesreferred to as a “walkie-talkie”), a camcorder, etc.

The vocoder 102 may include an encoder to encode speech signals intopackets and a decoder to decode packets into speech signals. The vocoder102 may be any type of vocoder, such as an enhanced variable rate coder(EVRC), Adaptive Multi-Rate (AMR), Fourth Generation vocoder (4GV), etc.Vocoders are described in co-assigned U.S. Pat. Nos. 6,397,175,6,434,519, 6,438,518, 6,449,592, 6,456,964, 6,477,502, 6,584,438,6,678,649, 6,691,084, 6,804,218, 6,947,888, which are herebyincorporated by reference.

The transceiver 104 may wirelessly transmit and receive packetscontaining encoded speech.

The network 110 may represent one or more base stations, base stationcontrollers (BSCs), mobile switching centers (MSCs), etc. If the firstdevice 100 is a landline phone, then network 110 may include componentsin a plain old telephone service (POTS) network. The network 110comprises a vocoder 112 and a transceiver 114. The network 110 mayinclude other components in addition to or instead of the componentsshown in FIG. 1.

The second communication device 120 may represent or be implemented in awireless communication device, a personal digital assistant (PDA), ahandheld device, a laptop computer, a desktop computer, a digitalcamera, a digital recording device, a network-enabled digitaltelevision, a mobile phone, a cellular phone, a satellite telephone, acamera phone, a terrestrial-based radiotelephone, a direct two-waycommunication device (sometimes referred to as a “walkie-talkie”), acamcorder, etc.

The second communication device 120 comprises a transceiver 124, aspeech and text unit 140, a speaker 142, a display 128, a user inputinterface 130, e.g., a keypad, and a microphone 146. The speech and textunit 140 comprises a vocoder 122, a speech to text conversion engine126, a controller 144, a text to speech conversion engine 132, and avoice synthesizer 134. The speech and text unit 140 may include othercomponents in addition to or instead of the components shown in FIG. 1.

One or more of the components or functions in the speech and text unit140 may be integrated into a single module, unit, component, orsoftware. For example, the speech to text conversion engine 126 may becombined with the vocoder 122. The text to speech conversion engine 132may be combined with the vocoder 122, such that text is converted intoencoded speech packets. The voice synthesizer 134 may be combined withthe vocoder 122 and/or the text to speech conversion engine 132.

The speech to text conversion engine 126 may convert voice/speech totext. The text to speech conversion engine 132 may convert text tospeech. The controller 144 may control operations and parameters of oneor more components in the speech and text unit 140.

The device 120 may provide several modes of communication for a user toreceive calls and/or respond to calls, as shown in the table below andin FIG. 2.

Mode Listen Speak Normal mode Yes Yes Second mode Yes No - transmit textor synthesized speech Third mode No - convert incoming Yes speech totext Fourth mode No - convert incoming No - transmit text or speech totext synthesized speechIn a normal mode (blocks 202 and 210), the user of the second device 120receives a call from the first device 100, listens to speech from thespeaker 142, and speaks into the microphone 146.

FIG. 2 illustrates a method of using the second device 120 of FIG. 1.When the second device 120 receives a call from the first device 100, auser of the second device 120 can select one of the modes via the userinterface 130 in block 200. Alternatively, the user may switch betweenmodes in block 200 before the second device 120 receives a call fromanother device. For example, if the user of the second device 120 entersa meeting, airplane, train, theater, restaurant, church or other placewhere incoming calls may be disruptive to others, the user may switchfrom the normal mode to one of the other three modes.

In a second mode (blocks 204 and 212), the user of the second device 130may listen to speech from the first device 100, such as using an earpiece, headset, or headphones, but not talk. Instead, the user of thesecond device 130 may type on the keypad 130 or use a writing stylus toenter handwritten text on the display 128. The display 128 or the textto speech conversion engine 132 may have a module that recognizeshandwritten text and characters. The device 120 may (a) send the text tothe first device 100 or (b) convert the text to speech with the text tospeech conversion engine 132.

The voice synthesizer 134 may synthesize the speech to producepersonalized speech signals to substantially match the user's naturalvoice. The voice synthesizer 134 may include a memory that storescharacteristics of the user's voice, such as pitch. A voice synthesizeris described in co-assigned U.S. Pat. No. 6,950,799, which isincorporated by reference. Another voice synthesizer is described inco-assigned U.S. patent application Ser. No. 11/398,364, which isincorporated by reference.

The vocoder 122 encodes the speech into packets. There may or may not bea short delay. In one configuration, other than a short time delay,communication with the second device 120 may appear seamless to the userof the first device 100. If the user of the second device 120 is in ameeting, the conversation may be more message-based than seamless.

In third and fourth modes (blocks 206, 208, 214 and 216), the device 120receives a call, and the speech to text conversion engine 126 convertsspeech/voice signals to text for display on the display 128. In oneconfiguration, the third and fourth modes may allow the user of thefirst device 100 to continue talking and not require the user of thefirst device 100 to switch to a text input mode. The speech to textconversion engine 126 may include a voice recognition module torecognize words and sounds to convert them to text.

In the third mode, the device 120 allows the user to speak into themicrophone 146, which passes speech to the vocoder 122 to encode intopackets.

In the fourth mode, the user of the second device 130 may type on thekeypad 130 or use a writing stylus to enter handwritten text on thedisplay 128. The device 120 may (a) send the text to the first device100 or (b) convert the text to speech with the text to speech conversionengine 132. The voice synthesizer 134 may synthesize the speech toproduce personalized speech signals to substantially match the user'snatural voice. The vocoder 122 encodes the speech into packets.

In the second and fourth modes, if the second device 120 is set toconvert text to speech and synthesize speech, there may be a time delaybetween when the second device 120 accepts a call from the first device100 and when the first device 100 receives speech packets. The seconddevice 120 may be configured to play a pre-recorded message by the userto inform the first device 100 that the user of the second device 120 isin a meeting and will respond using text to speech conversion.

The second and fourth modes may provide one or more advantages, such astransmitting speech without background noise, no need or reduced needfor echo cancellation, no need or reduced need for noise suppression,faster encoding, less processing, etc.

FIG. 1 shows an example where changes (new functions and/or elements)may be implemented in only the second communication device 120. Torealize the new modes (second, third and fourth modes) of communication,the second communication device 120 has a vocoder 122, a speech-to-textengine 126, a text-to-speech engine 132, etc. With this device 120, thesystem can support the new modes without any changes in the network 110and conventional phones 100 (landline, mobile phones, etc.). The device120 may receive and send voice packets regardless of the mode selectedby the user.

FIG. 3 illustrates another configuration 100A of the first communicationdevice 100 of FIG. 1. In FIG. 3, the first communication device 100Acomprises a speech to text conversion engine 300, an encoder 302, atransceiver 104, a decoder 304, and a user interface 330. The speech totext conversion engine 300 may convert voice/speech to text to betransmitted by the transceiver 104 to the network 110. The firstcommunication device 100A of FIG. 3 may allow the second device 120 tobe designed without a speech to text conversion engine 126. The firstcommunication device 100A of FIG. 3 may save bandwidth by sending textinstead of speech to the network 110. The user interface 330 may beoperable to receive a user selection of a mode from among a plurality ofmodes, wherein a first mode enables the vocoder 302, 304, and a secondmode enables the speech to text conversion engine 300.

FIG. 4 illustrates another configuration 110A of the network 110 ofFIG. 1. In FIG. 4, the network 110A comprises a voice coder/decoder 400,a transceiver 114 and a speech to text conversion engine 402. In anotherconfiguration, the network 110A may further comprise a text to speechconversion engine 404, a voice synthesizer 402 and a controller 444. Thevocoder 400 decodes speech packets to provide speech signals. The speechto text conversion engine 402 may convert voice/speech to text to betransmitted by the transceiver 114 to the second device 120. The network110A of FIG. 4 may allow the second device 120 to be designed without aspeech to text conversion engine 126 or allow the speech to textconversion engine 126 to be deactivated. The network 110A of FIG. 4 maysave bandwidth by sending text instead of speech to the second device120.

The network 110A in FIG. 4 may acquire knowledge of a configuration,situation or preference of the receiving device 120. If the network 110Arealizes that the receiving device 120 will not benefit from receivingvoice packets (e.g., sensing a user preference or place of the call, forexample, an extremely noisy environment and it is difficult to listen toreceived speech), then the network 110A will transform voice packets totext packets. Even if the receiving device 120 has the ability to changevoice packets to text packets (using a speech-to-text engine 126), itcan be a waste of bandwidth and device power to do this transformation(from voice to text) if the user is in a text-receiving mode (a meeting,or silent communication in general).

Thus, the network 110A in FIG. 4 may be used in a system where changes(new features and/or elements) are implemented only in the network 110A,i.e., no changes in communication devices or handsets. The network 110Amay take care of changing voice packets into text and vice versa wherethe mobile handsets do not have speech to text conversion units; or ifthe mobile handsets do have speech to text conversion units, thehandsets prefer not to do the conversion or cannot do the conversion dueto a lack of computational resources, battery power, etc.

For example, the first device 100 in FIG. 1 can send\receive voicepackets (i.e., first mode), while the second device 120 sends\receivestext (i.e., fourth mode). The second device 120 may not have unit 140(or just have a vocoder 122) or have unit 140 deactivated. To allow thesecond device 120 to operate in the fourth mode, the network 110A inFIG. 4 will change the first device's voice packets into text packets(using the speech-to-text engine 402) to send to the second device 120and will change text packets from the second device 120 to voice packets(using the text-to-speech engine 404) to send to the first device 100.

If the second device 120 does not have the unit 140, the second device120 can signal (in-band for example) a desired mode to the network 110Aand thus ask the network 110A to convert between speech and text, i.e.,do the functions of unit 140.

Personalized speech synthesis may be done in the network 110A. Asdescribed above, the unit 140 in FIG. 1 has a voice synthesizer 134 tochange the output of the text-to-speech engine 132 to personalizedspeech (the user's voice). In a system with the network 110A of FIG. 4,to produce voice packets that carry a voice signature of the user of thesecond device 120, the second device 120 may send stored voice packets(at the beginning of using second or fourth modes) that have thespectral parameters and pitch information of the user to the network110A. These few transmitted voice packets (preceding the text packets)can be used by the network 110A to produce personalized voice packets.

An example of transmitting packets for second or fourth modes from thesecond device 120 to the network 110A is described. The second device120 transmits to the network 110A at the beginning of using these “textmodes” (second or fourth modes) user pre-stored voice packets (Npackets) plus a mode of operation (1, 2, 3, or 4; request to do theconversion). The second device 120 may then send text packets.

A combination of the two configurations (FIG. 1 and FIG. 4) is alsopossible. When using one of these modes, the network 110A will enablethe text\speech conversion after sensing (e.g., receiving a request viasignaling) the capability of the receiving device 120, which does theconversion or lets the network 110A or receiving device 100A does theconversion.

One or more components and features described above may be implementedin a push to talk (PTT) or push to read communication device. A PTTdevice allows a user to push a button on the device and talk, while thedevice converts speech to text and transmits text packets to a networkor directly to another communication device. PTT communication is“message based,” rather than continuous, such as a standard voice call.A time period over which a user holds down the PTT button on the devicemay nicely frame the message that is then converted to text, etc.

The device 120 may have a dedicated memory for storing instructions anddata, as well as dedicated hardware, software, firmware, or combinationsthereof. If implemented in software, the techniques may be embodied asinstructions on a computer-readable medium such as random access memory(RAM), read-only memory (ROM), non-volatile random access memory(NVRAM), electrically erasable programmable read-only memory (EEPROM),FLASH memory, or the like. The instructions cause one or more processorsto perform certain aspects of the functionality described in thisdisclosure.

The techniques described in this disclosure may be implemented within ageneral purpose microprocessor, digital signal processor (DSP),application specific integrated circuit (ASIC), field programmable gatearray (FPGA), or other equivalent logic devices. For example, the speechand text unit 140 and associated components and modules, may beimplemented as parts of an encoding process, or coding/decoding (CODEC)process, running on a digital signal processor (DSP) or other processingdevice. Accordingly, components described as modules may formprogrammable features of such a process, or a separate process.

The speech and text unit 140 may have a dedicated memory for storinginstructions and data, as well as dedicated hardware, software,firmware, or combinations thereof. If implemented in software, thetechniques may be embodied as instructions executable by one or moreprocessors. The instructions may be stored on a computer-readable mediumsuch as random access memory (RAM), read-only memory (ROM), non-volatilerandom access memory (NVRAM), electrically erasable programmableread-only memory (EEPROM), FLASH memory, magnetic or optical datastorage device, or the like. The instructions cause one or moreprocessors to perform certain aspects of the functionality described inthis disclosure.

Various embodiments have been described. These and other embodiments arewithin the scope of the following claims.

1. An apparatus comprising: a vocoder operable to convert speech signalsinto packets and convert packets into speech signals; a speech to textconversion engine operable to convert speech to text; a text to speechconversion engine operable to convert text to speech; and a userinterface operable to receive a user selection of a mode from among aplurality of modes, wherein a first mode enables the speech to textconversion engine, a second mode enables the text to speech conversionengine, and a third mode enables the speech to text conversion engineand the text to speech conversion engine.
 2. The apparatus of claim 1,further comprising a display to display text from the speech to textconversion engine.
 3. The apparatus of claim 1, further comprising akeypad to receive input text from a user.
 4. The apparatus of claim 1,wherein the user interface is operable to receive a user selection of amode before the apparatus receives a call from another apparatus.
 5. Theapparatus of claim 1, wherein the user interface is operable to receivea user selection of a mode after the apparatus receives a call fromanother apparatus.
 6. The apparatus of claim 1, further comprising avoice synthesizer to synthesize a user's voice.
 7. The apparatus ofclaim 1, further comprising a transceiver operable to wirelesslytransmit encoded speech packets and text packets to a communicationnetwork.
 8. An apparatus comprising: a vocoder operable to convertspeech signals into packets and convert packets into speech signals; aspeech to text conversion engine operable to convert speech to text; auser interface operable to receive a user selection of a mode from amonga plurality of modes, wherein a first mode enables the vocoder, and asecond mode enables the speech to text conversion engine; and atransceiver operable to wirelessly transmit encoded speech packets andtext packets to a communication network.
 9. The apparatus of claim 8,further comprising a display to display text from the speech to textconversion engine.
 10. The apparatus of claim 8, further comprising akeypad to receive input text from a user.
 11. The apparatus of claim 8,wherein the user interface is operable to receive a user selection of amode before the apparatus receives a call from another apparatus. 12.The apparatus of claim 8, wherein the user interface is operable toreceive a user selection of a mode after the apparatus receives a callfrom another apparatus.
 13. A network apparatus comprising: a vocoderoperable to convert packets into speech signals; a speech to textconversion engine operable to convert speech to text; a selection unitoperable to switch between first and second modes, wherein the firstmode enables the vocoder, and a second mode enables the vocoder and thespeech to text conversion engine; and a transceiver operable towirelessly transmit encoded speech packets and text packets to acommunication network.
 14. The network apparatus of claim 13, furthercomprising a text to speech conversion engine operable to convert textto speech, wherein the selection unit is operable to switch to a thirdmode where the vocoder and both conversion engines are enabled.
 15. Thenetwork apparatus of claim 14, further comprising a voice synthesizeroperable to synthesize a user's voice from text converted to speech. 16.The network apparatus of claim 15, wherein the voice synthesizer isoperable to receive and store voice characteristics of a user's voice.17. The network apparatus of claim 13, further comprising a controlleroperable to receive a request from a communication device to convertspeech to text.
 18. The network apparatus of claim 13, furthercomprising a controller operable to receive a request from acommunication device to convert text to speech.
 19. A method comprising:receiving encoded speech packets; converting the received encoded speechpackets into speech signals; and receiving a user selection of a modefrom among a plurality of modes, wherein a first mode enables speech totext conversion, a second mode enables text to speech conversion, and athird mode enables speech to text and text to speech conversion.
 20. Themethod of claim 19, further comprising receiving a user selection for amode before receiving an incoming call.
 21. The method of claim 19,further comprising receiving a user selection for a mode after receivingan incoming call.